Lait Equitable: A Data-Driven Approach to Fair Trade milk
Author
Jayesh Smith and Emeline Raimon-Dacunha-Castelle Urs Hurni
Published
May 3, 2024
Abstract
The following project focuses on the analysis of the Lait Equitable dataset, which contains information on the production of fair trade milk in Switzerland. The goal of this project is to analyze the dataset and identify trends and patterns in the data that can help us better understand the production of fair trade milk in Switzerland. We will use a variety of data analysis techniques, including exploratory data analysis, data visualization, and statistical modeling, to analyze the dataset and draw conclusions about the production of fair trade milk in Switzerland.
1 Introduction
1.1 Company Background
Lait Equitable, also known as Fair Milk, is a Swiss initiative aimed at providing fair compensation to milk producers, ensuring they receive a price that covers production costs, specifically 1 CHF per liter of milk. This initiative emerged in response to the challenging conditions facing Swiss dairy production, including plummeting milk prices that do not cover production costs. Through the cooperative “Lait Equitable,” the initiative collects, processes, and distributes milk to various sales points, ensuring producers receive fair compensation.
Key Points:
Foundation: Addressing the dire situation of milk producers in Switzerland due to inadequate compensation.
Operation: Via the “Lait Equitable” cooperative, ensuring fair pay of 1 CHF per liter to producers.
Product Range: Includes milk and possibly other dairy products found in retail stores under the Faireswiss brand.
Membership: Open to all Swiss milk producers, including those specializing in cheese production, e.g., Gruyère.
1.2 Related Work
1.3 Research questions
Sales Analysis Across Manors: To determine the factors contributing to the success of Lait Equitable’s milk in some Manor stores but not in others.
Price Differences: To analyze the price differences between conventional and organic milk and their impact on market demand and supply.
Demand for Organic Milk: To study how the demand for organic milk has evolved in recent years and identify the driving factors behind this trend.
2 Data
Sources
Description
Wrangling/cleaning
Spotting mistakes and missing data (could be part of EDA too)
Listing anomalies and outliers (could be part of EDA too)
2.1 Swiss Producer
IMPLEMENTER EMELINE PART
2.1.1 Wrangling and Cleaning
Click to show code
library(data.table)file_path <-"../data/"df_producteur <-read_excel(paste0(file_path, "lait_cru_producteur.xlsx"), sheet =1)df_producteur$date <-as.Date(df_producteur$date)library(kableExtra)# Create a tibble with variable descriptions for df_producteurvariable_table <-tibble(Variable =c("Date", "prix_bio", "prix_non_bio", "delta", "Delta_pourcent"),Description =c("The date when the prices were recorded, in a year-month-day format.","The recorded price of organic milk on the given date.","The recorded price of non-organic milk on the given date.","The absolute difference between the organic and non-organic milk prices.","The percentage difference between the organic and non-organic milk prices." ))# Display the table using kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Variable
Description
Date
The date when the prices were recorded, in a year-month-day format.
prix_bio
The recorded price of organic milk on the given date.
prix_non_bio
The recorded price of non-organic milk on the given date.
delta
The absolute difference between the organic and non-organic milk prices.
Delta_pourcent
The percentage difference between the organic and non-organic milk prices.
2.1.2 Description
This data set explains the Milk price to producers over time. Essential for Macro Anaylsis of the milk industry in Switzerland.
Have a look : ::: {.cell layout-align=“center”}
Click to show code
# create a new data cleaneddf_producteur_show <- df_producteur %>%mutate(delta = prix_bio - prix_non_bio,delta_pourcent = (prix_bio - prix_non_bio) / prix_non_bio *100) %>%select(date, prix_bio, prix_non_bio, delta, delta_pourcent) %>%#round all column to 2 decimal placesmutate_if(is.numeric, round, 2) #print max and min values for delta_pourcent# max_delta_pourcent <- max(df_producteur_show$delta_pourcent, na.rm = TRUE)# max_delta_pourcent# min_delta_pourcent <- min(df_producteur_show$delta_pourcent, na.rm = TRUE)# min_delta_pourcent#display cleaned data using reactablelibrary(reactable)reactable( df_producteur_show, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
The dataset from 2023 was meticulously cleaned and standardized to ensure accuracy in our analysis. Initial steps included loading the data from an Excel file and renaming columns to reflect clearer, month-specific sales data for easier readability. Additionally, we corrected city names to maintain consistency across datasets. This included mapping various forms of city names to their standardized counterparts (e.g., ‘Bâle’ to ‘Basel’).
2.2.1.2 Description
The dataset is very light and only contains the sales data for the year 2023. It is essential however for the analysis of the sales of Lait Equitable in different Manor stores.
The dataset contains monthly sales data from different Manor store locations for the year 2023.
Here is a preview of the cleaned and structured data: ::: {.cell layout-align=“center”}
Click to show code
#load python dfdf_sales_2023 <- py$df# Load necessary librarieslibrary(tibble)library(kableExtra)# Create a tibble with variable descriptions for df_manor_salesvariable_table <-tibble(Variable =c("Row Labels", "Monthly Columns (2023-01-01 to 2023-12-01)", "Grand Total"),Description =c("Identifies the Manor store location by name.","Each column represents sales figures for a specific month of 2023","Total sales across all months of 2023 for each location" ))# Display the table using kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
Variable
Description
Row Labels
Identifies the Manor store location by name.
Monthly Columns (2023-01-01 to 2023-12-01)
Each column represents sales figures for a specific month of 2023
Grand Total
Total sales across all months of 2023 for each location
Click to show code
# Using the provided column names correctly in the dataframe df_sales_2023df_sales_2023_show <- df_sales_2023 %>%# Ensure you convert the column names to standard ones if neededrename(Location =`Row Labels`) %>%# Correctly sum the monthly sales columns from Jan 2023 to Dec 2023mutate(Total_Sales =rowSums(select(., `Jan 2023`:`Dec 2023`), na.rm =TRUE)) %>%select(Location, `Jan 2023`:`Dec 2023`, Total_Sales) %>%mutate_if(is.numeric, round, 2) # round all numeric columns to 2 decimal places# Display the data using reactable for an interactive tablereactable( df_sales_2023_show, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
:::
2.2.2 Dataset Sales 2022
2.2.2.1 Wrangling and Cleaning
Following the methodology established with the 2023 dataset, the 2022 sales data was similarly processed. The data from 2022, while structurally different, was also standardized to facilitate comparison and analysis. This included renaming columns to ensure uniformity in location names across both datasets. ::: {.cell layout-align=“center”}
Click to show code
# Load the data for 2022file_path_2022 ='../data/sales_2022.xlsx'df_2022 = pd.read_excel(file_path_2022)# Standardize city names based on the provided mappingcity_name_mapping = {'Bâle': 'Basel','Genève': 'Geneva','Bienne': 'Biel/Bienne','Chavannes': 'Chavannes-de-Bogis','Marin': 'Marin-Epagnier','Vesenaz': 'Vésenaz','Yverdon': 'Yverdon-les-Bains','Saint-Gall Webersbleiche': 'St. Gall'}# Rename columns to standardize city namesdf_2022.rename(columns=city_name_mapping, inplace=True)# Pivoting the table to get total sales per location for 2022, summing across all productssales_columns_2022 = [col for col in df_2022.columns if col notin ['Code article', 'Description article', 'Marque', 'Code Fournisseur', 'Description Fournisseur']]df_2022_total_sales = df_2022[sales_columns_2022].sum().reset_index()df_2022_total_sales.columns = ['Location', 'Total Sales 2022']
:::
2.2.2.2 Description
The 2022 dataset, unlike the 2023 dataset, includes a variety of products, each recorded with sales figures across different locations. This dataset is notably less complex, focusing on total sales rather than monthly breakdowns, yet provides critical insights into the sales performance of different products. ::: {.cell layout-align=“center”}
Click to show code
# Load the 2022 sales datadf_sales_2022 <- py$df_2022# Load necessary librarieslibrary(tibble)library(kableExtra)# Create a tibble with variable descriptions for df_salesvariable_table <-tibble(Variable =c("Code article", "Description article", "Marque", "Code Fournisseur", "Description Fournisseur","Location Columns (e.g., Ascona-Delta, Baden, Bâle, etc.)"),Description =c("Unique identifier for each product.","Descriptive name of the product.","Brand of the product.","Supplier code.","Supplier name.","Each of these columns represents sales figures for that specific location." ))# Display the table using kableExtravariable_table %>%kbl() %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed"))
# Extracting the total sales for 2023 from the first datasetdf_2023_total_sales = df[['Row Labels', 'Grand Total']].rename(columns={'Row Labels': 'Location', 'Grand Total': 'Total Sales 2023'})# Merging the 2022 and 2023 datasets on Locationmerged_sales_data = pd.merge(df_2022_total_sales, df_2023_total_sales, on='Location', how='outer')# Filling any NaN values that might have occurred due to locations present in one dataset and not the othermerged_sales_data.fillna(0, inplace=True)
Click to show code
# Load the merged sales datadf_merged_sales <- py$merged_sales_data#show it using reactablereactable( df_merged_sales, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
The 2022 sales data has been aggregated and standardized for each location. The merged dataset now shows the total sales per location for both 2022 and 2023. This dataset offers a comprehensive view of Lait Equitable’s sales dynamics over two consecutive years, highlighting trends and changes in consumer behavior across different locations.
2.2.3 Political Parties Dataset
2.2.3.1 Wrangling and Cleaning
explain what you did… ::: {.cell layout-align=“center”}
Click to show code
# Read sales data from Excelsales_data <-read_excel("../data/Ventes annuelles.xlsx")# Read political party data from Excelparty_data <-read_excel("../data/partisPolitiqueManor.xlsx")# Clean up party_data to match sales_data locationsparty_data_cleaned <- party_data %>%mutate(Location =gsub(" ", "", Location)) %>%filter(Location %in% sales_data$Location)# Calculate party presence percentages for each locationparty_data_cleaned <- party_data_cleaned %>%mutate(PLR_Presence = PLR / (PLR + PS + UDC + Centre + Verts) *100,PS_Presence = PS / (PLR + PS + UDC + Centre + Verts) *100,UDC_Presence = UDC / (PLR + PS + UDC + Centre + Verts) *100,Centre_Presence = Centre / (PLR + PS + UDC + Centre + Verts) *100,Verts_Presence = Verts / (PLR + PS + UDC + Centre + Verts) *100)# Merge sales_data with party presence datamerged_data <-merge(sales_data, party_data_cleaned, by ="Location")
:::
2.2.3.2 Description
Write kableextra to describe the dataset here ::: {.cell layout-align=“center”}
:::
Click to show code
# Display the merged data using reactablereactable( merged_data, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
UTILISER CODE POUR DISPLAY DATA (JAYESH)
3 Exploratory data analysis
3.1 Lait Equitable Products
3.1.1 Sales Distribution Accross Months
Click to show code
# Remove 'Grand Total' column and the row labels columnmonthly_sales <- df_sales_2023 %>%select(-c(`Grand Total`, `Row Labels`))# Aggregate the sales per month across all locationstotal_sales_per_month <-colSums(monthly_sales)# Create a data frame for plottingmonthly_sales_df <-data.frame(Month =names(total_sales_per_month), Sales = total_sales_per_month)# Sort the data frame by Sales in descending ordersorted_monthly_sales_df <- monthly_sales_df %>%arrange(desc(Sales))# Plotting with ggplot2 using viridis color paletteggplot(sorted_monthly_sales_df, aes(x=reorder(Month, -Sales), y=Sales, fill=Month)) +geom_bar(stat="identity", show.legend =FALSE, fill ="#24918d", color ="black") +labs(title ="Total Sales by Month Across All Locations (2023)", x ="Month", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The graph shows total monthly sales across all locations for the “Lait Equitable” in 2023. It shows the month with the highest sales, March and continuing to lower sales months. The least profitable month appears to be July.
Highest Sales in March: The graph starts with March, which shows the highest sales, almost reaching 25,000 units. This suggests that March was a particularly strong month for sales, possibly due to seasonal factors or specific marketing campaigns.
Gradual Decline in Sales: As we move from left to right, there is a general trend of declining sales. After March, the next highest sales are in December, followed by April, May, and so on. This indicates that sales in March were not sustained throughout the year.
Mid-year and End-Year Trends: While the graph is not in chronological order, it shows that some months like December (typically strong due to the holiday season) also performed well, but none reached the peak seen in March.
Lower Sales in the Latter Months Displayed: The months at the right end of the graph, such as June and July, show the lowest sales figures in the year. This could indicate a seasonal dip or other market dynamics affecting these months. One supposition could be that people are on vacations at these dates due to school vacations.
3.1.2 Sales Distribution Accross Locations
Click to show code
# First, we need to remove the 'Grand Total' column if it's includeddf <- df_sales_2023[, -ncol(df_sales_2023)]# Sum sales across all months for each locationtotal_sales_by_location <- df %>%mutate(Total_Sales =rowSums(select(., -`Row Labels`))) %>%select(`Row Labels`, Total_Sales)# Sort the locations by total sales in descending ordersorted_sales_by_location <- total_sales_by_location %>%arrange(desc(Total_Sales))# Plotting the data with ggplot2ggplot(sorted_sales_by_location, aes(x=reorder(`Row Labels`, Total_Sales), y=Total_Sales, fill=`Row Labels`)) +geom_bar(stat="identity", show.legend =FALSE, fill ="#24918d", color ="black") +labs(title ="Total Sales by Location (2023)", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability
The graph illustrates the total sales by location for Lait Equitable across various stores in 2023, organized from the lowest to the highest sales volume.
Variability in Sales Across Locations: The graph displays a significant variation in sales across different locations. The left side of the graph shows locations with the least sales, starting with Chur, Rapperswil, St. Gall, and progressively increasing towards the right.
Low Sales in Certain Areas: Locations like Chur, Rapperswil, and St. Gall have notably low sales, which could indicate either a lower demand for Lait Equitable’s products in these areas or possibly less effective marketing and distribution strategies.
High Sales in Specific Locations: The right end of the graph, particularly the last five locations, shows a sharp increase in sales. Notably, Vevey, Marin-Epagnier, Sierre and Monthey exhibit high sales, with Monthey being the highest. This might indicate a stronger market presence, better consumer acceptance, or more effective promotional activities in these regions.
Potential Market Strengths and Weaknesses: The graph effectively highlights where Lait Equitable is performing well and where there might be room for improvement. For instance, the high sales in cities like Sierre and Monthey suggest strong market penetration and acceptance.
Strategic Insights: For the Lait Equitable, this graph provides crucial data points for understanding which locations might need more focused marketing efforts or adjustments in distribution strategies. Additionally, it could help in identifying successful strategies in high-performing locations that could be replicated in areas with lower sales.
Click to show code
#remove grand totaldf <- df_sales_2023[, -ncol(df_sales_2023)]# Transform the data into a long format where each row contains a location, a month, and saleslong_data <- df %>%pivot_longer(cols =-`Row Labels`, names_to ="Month", values_to ="Sales") %>%mutate(Location =`Row Labels`)# Create a plotly object for an interactive boxplotfig <-plot_ly(long_data, x =~Location, y =~Sales, type ='box',hoverinfo ='text', text =~paste('Month:', Month, '<br>Sales:', Sales),marker =list(color ="#7e57c2",boxpoints ="all",jitter =0.3),box =list(line =list(color ="#24918d")) ) %>%layout(title ="Distribution of Monthly Sales Across Locations",xaxis =list(title ="Location"),yaxis =list(title ="Monthly Sales"),showlegend =FALSE, width=600,height =800) %>%config(displayModeBar =FALSE) # Optional: hide the mode bar#> Warning: Specifying width/height in layout() is now deprecated.#> Please specify in ggplotly() or plot_ly()# Display the plotfig#> Warning: 'box' objects don't have these attributes: 'box'#> Valid attributes include:#> 'alignmentgroup', 'boxmean', 'boxpoints', 'customdata', 'customdatasrc', 'dx', 'dy', 'fillcolor', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hoveron', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'jitter', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'lowerfence', 'lowerfencesrc', 'marker', 'mean', 'meansrc', 'median', 'mediansrc', 'meta', 'metasrc', 'name', 'notched', 'notchspan', 'notchspansrc', 'notchwidth', 'offsetgroup', 'opacity', 'orientation', 'pointpos', 'q1', 'q1src', 'q3', 'q3src', 'quartilemethod', 'sd', 'sdsrc', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textsrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'upperfence', 'upperfencesrc', 'visible', 'whiskerwidth', 'width', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
This graphs show the variability in sales across different locations for each month in 2023. The boxplot provides a visual representation of the distribution of sales figures, highlighting the range, median, and outliers in each location.
We observe that the outliers are high or low sales months as we analyze previously. It confirms the previous analysis and provides a more detailed view of the sales distribution across locations.
3.1.3 Top Performing / Worse Performing Locations
Click to show code
#using grand total to sort the data from top to bottomdf <- df_sales_2023#using 'grand total' column as total sales plot the top and bottom locationsdf %>%arrange(desc(`Grand Total`)) %>%slice_head(n =5) %>%select(`Row Labels`, `Grand Total`) %>%ggplot(aes(x =reorder(`Row Labels`, `Grand Total`), y =`Grand Total`, fill =`Row Labels`)) +geom_bar(stat ="identity", show.legend =FALSE, fill ="#33848D", color ="black") +labs(title ="Top 5 Performing Locations by Total Sales (2023)", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability# worse performing locationsdf %>%arrange(`Grand Total`) %>%slice_head(n =5) %>%select(`Row Labels`, `Grand Total`) %>%ggplot(aes(x =reorder(`Row Labels`, `Grand Total`), y =`Grand Total`, fill =`Row Labels`)) +geom_bar(stat ="identity", show.legend =FALSE, fill ="#33848D", color ="black") +labs(title ="Bottom 5 Performing Locations by Total Sales (2023)", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability
As previously analyzed, the top and bottom performing locations are displayed in the bar charts. The top 5 locations with the highest total sales are shown in the first graph, while the bottom 5 locations with the lowest total sales are displayed in the second graph.
Top-performing locations are : Monthey, Sierre, Marin-Epagnier, Vevey, and Sion. Worse-performing locations are : Basel, St. Gall, Sargans, Rapperswil, and Chur.
3.1.4 2022 vs 2023
Click to show code
#plot a bar chart to compare the total sales in 2022 and 2023 and add transparency to the barsdf_merged_sales %>%ggplot(aes(x =reorder(Location, -`Total Sales 2023`), y =`Total Sales 2023`, fill ="2023")) +geom_bar(aes(x =reorder(Location, -`Total Sales 2022`), y =`Total Sales 2022`, fill ="2022"), stat ="identity", position ="dodge", fill ="#7e57c2", color ="black", alpha =0.7 ) +geom_bar(stat ="identity", position ="dodge", fill ="#33848D", color ="black", alpha =0.7) +labs(title ="Total Sales Comparison Between 2022 and 2023 by Location", x ="Location", y ="Total Sales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1, vjust=0.5)) # Rotate the x-axis text for better readability
Trend of Decline: A significant number of Manor locations have lower sales figures in 2023 compared to 2022. This trend suggests that Lait Equitable might be facing challenges in these areas, which could include increased competition, changing consumer preferences, or other market dynamics affecting the demand for their products.
Monthey’s Decline: The bar chart shows that Monthey experienced a substantial decrease in sales in 2023 compared to 2022. This would be a key point of concern for Lait Equitable, and understanding why Monthey is underperforming is essential. This could be due to a range of factors, such as local economic conditions, operational challenges, increased competition, or changes in consumer preference within that particular area.
3.1.5 Map
Click to show code
# Function to calculate the dynamic radiusdef calculate_radius(volume, max_volume, min_volume, max_radius=20): normalized_volume = (volume - min_volume) / (max_volume - min_volume)return normalized_volume * max_radius +3# Function to get latitude and longitudedef get_lat_lon(city):try: time.sleep(1) # Simple rate-limiting mechanism location = geolocator.geocode(city +', Switzerland')return location.latitude, location.longitudeexceptAttributeError:returnNone, None# Read data from different product categoriesfile_paths = {'All Products': ("../data/Produits laitiers équitables - 2023.xlsb", 'Par SM'),'Milk Drink': ("../data/lait_drink_sales_per_stores_2023.xlsx", 'Sheet1'),'Milk Entier': ("../data/lait_entier_sales_per_stores_2023.xlsx", 'Sheet1'),'Fondue': ("../data/fondue_sales_per_stores_2023.xlsx", 'Sheet1'),'Delice': ("../data/delice_sales_per_stores_2023.xlsx", 'Sheet1'),'Creme': ("../data/creme_cafe_sales_per_stores_2023.xlsx", 'Sheet1')}# Create a folium mapm = folium.Map(location=[46.8182, 8.2275], zoom_start=8)# Instantiate the geolocatorgeolocator = Nominatim(user_agent="le_stores")# Loop through each categoryfor category, (file_path, sheet_name) in file_paths.items(): engine ='pyxlsb'if'xlsb'in file_path elseNone df = pd.read_excel(file_path, engine=engine, sheet_name=sheet_name)if category =='All Products':# Skip the first six rows and rename columns based on the provided structure df = df.iloc[6:] df.rename(columns={'Quantités vendues - année 2023': 'City','Unnamed: 1': '01/01/2023','Unnamed: 2': '02/01/2023','Unnamed: 3': '03/01/2023','Unnamed: 4': '04/01/2023','Unnamed: 5': '05/01/2023','Unnamed: 6': '06/01/2023','Unnamed: 7': '07/01/2023','Unnamed: 8': '08/01/2023','Unnamed: 9': '09/01/2023','Unnamed: 10': '10/01/2023','Unnamed: 11': '11/01/2023','Unnamed: 12': '12/01/2023','Unnamed: 13': 'Total General' }, inplace=True)else:# Renaming columns for XLSX files based on your last dataframe example df.rename(columns={ df.columns[0]: 'City', df.columns[-1]: 'Total General' }, inplace=True)# Standardize city names correct_city_names = {'Bâle': 'Basel','Genève': 'Geneva','Bienne': 'Biel/Bienne','Chavannes': 'Chavannes-de-Bogis','Marin': 'Marin-Epagnier','Vesenaz': 'Vésenaz','Yverdon': 'Yverdon-les-Bains','Saint-Gall Webersbleiche': 'St. Gall' } df['City'] = df['City'].apply(lambda x: correct_city_names.get(x, x))# Get latitudes and longitudes df[['Lat', 'Lon']] = df.apply(lambda row: pd.Series(get_lat_lon(row['City'])), axis=1)# Define color scale and feature group max_sales = df['Total General'].max() min_sales = df['Total General'].min() color_scale =cmp.linear.viridis.scale(min_sales, max_sales) fg = folium.FeatureGroup(name=category)# Add markersfor index, row in df.iterrows():if pd.notnull(row['Lat']) and pd.notnull(row['Lon']): radius = calculate_radius(row['Total General'], max_sales, min_sales) folium.CircleMarker( location=[row['Lat'], row['Lon']], radius=radius, popup=f"{row['City']}: {row['Total General']}", color=color_scale(row['Total General']), fill=True, fill_color=color_scale(row['Total General']) ).add_to(fg) fg.add_to(m)#> <folium.vector_layers.CircleMarker object at 0x0000021621155CA0>#> <folium.vector_layers.CircleMarker object at 0x0000021621155160>#> <folium.vector_layers.CircleMarker object at 0x00000216211563F0>#> <folium.vector_layers.CircleMarker object at 0x00000216225CD340>#> <folium.vector_layers.CircleMarker object at 0x00000216225CEBD0>#> <folium.vector_layers.CircleMarker object at 0x00000216225CF080>#> <folium.vector_layers.CircleMarker object at 0x000002162231FFB0>#> <folium.vector_layers.CircleMarker object at 0x00000216223CDD00>#> <folium.vector_layers.CircleMarker object at 0x0000021622349460>#> <folium.vector_layers.CircleMarker object at 0x00000216224B3AA0>#> <folium.vector_layers.CircleMarker object at 0x00000216211206E0>#> <folium.vector_layers.CircleMarker object at 0x000002161ECF5BE0>#> <folium.vector_layers.CircleMarker object at 0x00000216225CF170>#> <folium.vector_layers.CircleMarker object at 0x000002162259BB00>#> <folium.vector_layers.CircleMarker object at 0x0000021622599F40>#> <folium.vector_layers.CircleMarker object at 0x000002162259BDA0>#> <folium.vector_layers.CircleMarker object at 0x0000021622599B80>#> <folium.vector_layers.CircleMarker object at 0x000002162259B2C0>#> <folium.vector_layers.CircleMarker object at 0x00000216225998B0>#> <folium.vector_layers.CircleMarker object at 0x000002162259BBF0>#> <folium.vector_layers.CircleMarker object at 0x0000021622599490>#> <folium.vector_layers.CircleMarker object at 0x00000216223634A0>#> <folium.vector_layers.CircleMarker object at 0x00000216224FF350>#> <folium.vector_layers.CircleMarker object at 0x00000216224FFD40>#> <folium.vector_layers.CircleMarker object at 0x0000021622531280>#> <folium.vector_layers.CircleMarker object at 0x00000216223610D0>#> <folium.map.FeatureGroup object at 0x0000021621157890>#> <folium.vector_layers.CircleMarker object at 0x0000021621011BB0>#> <folium.vector_layers.CircleMarker object at 0x00000216211794F0>#> <folium.vector_layers.CircleMarker object at 0x0000021623738BF0>#> <folium.vector_layers.CircleMarker object at 0x0000021623738F80>#> <folium.vector_layers.CircleMarker object at 0x0000021623738890>#> <folium.vector_layers.CircleMarker object at 0x0000021623738A10>#> <folium.vector_layers.CircleMarker object at 0x0000021623739370>#> <folium.vector_layers.CircleMarker object at 0x000002162373A1E0>#> <folium.vector_layers.CircleMarker object at 0x0000021623738EC0>#> <folium.vector_layers.CircleMarker object at 0x000002162373A9C0>#> <folium.vector_layers.CircleMarker object at 0x000002162373A600>#> <folium.vector_layers.CircleMarker object at 0x0000021623739460>#> <folium.vector_layers.CircleMarker object at 0x000002162373A6F0>#> <folium.vector_layers.CircleMarker object at 0x000002162373A2A0>#> <folium.vector_layers.CircleMarker object at 0x000002162373A6C0>#> <folium.vector_layers.CircleMarker object at 0x000002162373B500>#> <folium.vector_layers.CircleMarker object at 0x000002162373AC30>#> <folium.vector_layers.CircleMarker object at 0x000002162373ADB0>#> <folium.vector_layers.CircleMarker object at 0x000002162373AFF0>#> <folium.vector_layers.CircleMarker object at 0x000002162373AF60>#> <folium.vector_layers.CircleMarker object at 0x000002162373B770>#> <folium.vector_layers.CircleMarker object at 0x000002162373BAD0>#> <folium.vector_layers.CircleMarker object at 0x000002162373BA10>#> <folium.vector_layers.CircleMarker object at 0x000002162373B5C0>#> <folium.vector_layers.CircleMarker object at 0x0000021623739D00>#> <folium.vector_layers.CircleMarker object at 0x000002162373BC80>#> <folium.vector_layers.CircleMarker object at 0x000002162373B800>#> <folium.map.FeatureGroup object at 0x000002162244CFE0>#> <folium.vector_layers.CircleMarker object at 0x00000216223625A0>#> <folium.vector_layers.CircleMarker object at 0x0000021620E08800>#> <folium.vector_layers.CircleMarker object at 0x0000021623741B50>#> <folium.vector_layers.CircleMarker object at 0x0000021623742000>#> <folium.vector_layers.CircleMarker object at 0x0000021623741A00>#> <folium.vector_layers.CircleMarker object at 0x0000021623741C40>#> <folium.vector_layers.CircleMarker object at 0x0000021623741CD0>#> <folium.vector_layers.CircleMarker object at 0x0000021623741C70>#> <folium.vector_layers.CircleMarker object at 0x0000021623742810>#> <folium.vector_layers.CircleMarker object at 0x0000021623741AC0>#> <folium.vector_layers.CircleMarker object at 0x00000216237420C0>#> <folium.vector_layers.CircleMarker object at 0x0000021623742720>#> <folium.vector_layers.CircleMarker object at 0x0000021623742360>#> <folium.vector_layers.CircleMarker object at 0x0000021623742750>#> <folium.vector_layers.CircleMarker object at 0x0000021623742570>#> <folium.vector_layers.CircleMarker object at 0x0000021623742E10>#> <folium.vector_layers.CircleMarker object at 0x00000216237428D0>#> <folium.vector_layers.CircleMarker object at 0x0000021623742C60>#> <folium.vector_layers.CircleMarker object at 0x0000021623742ED0>#> <folium.vector_layers.CircleMarker object at 0x0000021623742E70>#> <folium.vector_layers.CircleMarker object at 0x0000021623742F60>#> <folium.vector_layers.CircleMarker object at 0x0000021623743B00>#> <folium.vector_layers.CircleMarker object at 0x0000021623743080>#> <folium.vector_layers.CircleMarker object at 0x0000021623743410>#> <folium.vector_layers.CircleMarker object at 0x0000021623743D10>#> <folium.vector_layers.CircleMarker object at 0x0000021623743620>#> <folium.vector_layers.CircleMarker object at 0x0000021623743500>#> <folium.map.FeatureGroup object at 0x00000216203C7C80>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7200>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6F90>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6D50>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6B40>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6D80>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6900>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6D20>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7050>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6F00>#> <folium.vector_layers.CircleMarker object at 0x00000216237A74A0>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7DD0>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7EF0>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7C50>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7530>#> <folium.vector_layers.CircleMarker object at 0x00000216237A6A50>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7A40>#> <folium.vector_layers.CircleMarker object at 0x00000216237A7CB0>#> <folium.vector_layers.CircleMarker object at 0x000002162373B1D0>#> <folium.vector_layers.CircleMarker object at 0x000002162373B860>#> <folium.vector_layers.CircleMarker object at 0x0000021623739EB0>#> <folium.vector_layers.CircleMarker object at 0x00000216237879E0>#> <folium.vector_layers.CircleMarker object at 0x0000021622604A70>#> <folium.vector_layers.CircleMarker object at 0x0000021623786FC0>#> <folium.vector_layers.CircleMarker object at 0x0000021623787830>#> <folium.vector_layers.CircleMarker object at 0x0000021623742480>#> <folium.vector_layers.CircleMarker object at 0x0000021623742AE0>#> <folium.map.FeatureGroup object at 0x0000021622596AE0>#> <folium.vector_layers.CircleMarker object at 0x00000216237E8740>#> <folium.vector_layers.CircleMarker object at 0x00000216237E8BF0>#> <folium.vector_layers.CircleMarker object at 0x00000216237E9550>#> <folium.vector_layers.CircleMarker object at 0x00000216237E89E0>#> <folium.vector_layers.CircleMarker object at 0x00000216237E9520>#> <folium.vector_layers.CircleMarker object at 0x00000216237E91C0>#> <folium.vector_layers.CircleMarker object at 0x00000216237E93D0>#> <folium.vector_layers.CircleMarker object at 0x00000216237E94F0>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA090>#> <folium.vector_layers.CircleMarker object at 0x00000216237E9100>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB020>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA330>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA4B0>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA5D0>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA8A0>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA930>#> <folium.vector_layers.CircleMarker object at 0x00000216237EA390>#> <folium.vector_layers.CircleMarker object at 0x00000216237EAB10>#> <folium.vector_layers.CircleMarker object at 0x00000216237EADE0>#> <folium.vector_layers.CircleMarker object at 0x00000216237EAA50>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB200>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB140>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB350>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB080>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB6E0>#> <folium.vector_layers.CircleMarker object at 0x00000216237EBA40>#> <folium.vector_layers.CircleMarker object at 0x00000216237EB8F0>#> <folium.map.FeatureGroup object at 0x00000216237E99A0>#> <folium.vector_layers.CircleMarker object at 0x00000216226053A0>#> <folium.vector_layers.CircleMarker object at 0x0000021622606B70>#> <folium.vector_layers.CircleMarker object at 0x0000021622606450>#> <folium.vector_layers.CircleMarker object at 0x0000021622605BE0>#> <folium.vector_layers.CircleMarker object at 0x0000021622604290>#> <folium.vector_layers.CircleMarker object at 0x0000021622607230>#> <folium.vector_layers.CircleMarker object at 0x0000021622606AB0>#> <folium.vector_layers.CircleMarker object at 0x0000021622604470>#> <folium.vector_layers.CircleMarker object at 0x0000021622604710>#> <folium.vector_layers.CircleMarker object at 0x00000216226045F0>#> <folium.vector_layers.CircleMarker object at 0x00000216226071D0>#> <folium.vector_layers.CircleMarker object at 0x00000216226078F0>#> <folium.vector_layers.CircleMarker object at 0x0000021622606960>#> <folium.vector_layers.CircleMarker object at 0x0000021622606D50>#> <folium.vector_layers.CircleMarker object at 0x0000021622606A80>#> <folium.vector_layers.CircleMarker object at 0x00000216226068A0>#> <folium.vector_layers.CircleMarker object at 0x0000021622607F50>#> <folium.vector_layers.CircleMarker object at 0x0000021622606B10>#> <folium.vector_layers.CircleMarker object at 0x0000021622606A20>#> <folium.vector_layers.CircleMarker object at 0x0000021622606240>#> <folium.vector_layers.CircleMarker object at 0x0000021622606000>#> <folium.vector_layers.CircleMarker object at 0x0000021622605EE0>#> <folium.vector_layers.CircleMarker object at 0x0000021622605AC0>#> <folium.vector_layers.CircleMarker object at 0x0000021622606420>#> <folium.vector_layers.CircleMarker object at 0x0000021622606060>#> <folium.vector_layers.CircleMarker object at 0x0000021622605B20>#> <folium.vector_layers.CircleMarker object at 0x00000216226052E0>#> <folium.map.FeatureGroup object at 0x0000021622604740># Add layer control and save the mapfolium.LayerControl().add_to(m)#> <folium.map.LayerControl object at 0x000002162373BD10>m.save('combined_product_map.html')m
Make this Notebook Trusted to load map: File -> Trust Notebook
3.2 Price To Producers Lait Cru
3.2.1 Organic Milk vs Non Organic (bio) Milk
Click to show code
# Create xts objectprices_xts <-xts(df_producteur[, c("prix_bio", "prix_non_bio")], order.by = df_producteur$date)# Plot using dygraphsdygraph(prices_xts, main ="Trends in Milk Prices (Organic vs. Non-Organic)", width ="600px", height ="400px") %>%dySeries("prix_bio", label ="Organic Price", color ="#24918d") %>%dySeries("prix_non_bio", label ="Non-Organic Price", color ="#7e57c2") %>%dyOptions(stackedGraph =FALSE) %>%dyRangeSelector(height =20)
Click to show code
# Create an xts object for the delta series, ensuring the series name is retaineddelta_xts <-xts(x = df_producteur[,"delta", drop =FALSE], order.by = df_producteur$date)# Plot using dygraphsdf_p_delta <-dygraph(delta_xts, main ="Difference in Prices Between Organic and Non-Organic Milk Over Time", width ="600px", height ="400px") %>%dySeries("delta", label ="Delta in Price", color ="#24918d") %>%dyOptions(stackedGraph =FALSE) %>%dyRangeSelector(height =20)# Print the dygraph to display itp_delta
3.2.2 Seasonality
Click to show code
# Process the data to extract month and yeardf_producteur <- df_producteur %>%mutate(Month =format(date, "%m"),Year =format(date, "%Y")) %>%arrange(date) # Ensure data is in chronological order# Plotting the data with ggplot2, showing the trend within each yearp_seaso_2 <-ggplot(df_producteur, aes(x = Month, y = prix_bio, group = Year, color =as.factor(Year))) +geom_smooth(se =FALSE, method ="loess", span =0.3, size =0.7) +labs(title ="Monthly Milk Prices by Year",x ="Month",y ="Price of Organic Milk",color ="Year") +theme_minimal() +scale_color_viridis_d() +theme(legend.position ="bottom", axis.text.x =element_text(angle =45, hjust =1))#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.#> i Please use `linewidth` instead.# Convert to an interactive plotly objectinteractive_plot_seaso_2 <-ggplotly(p_seaso_2, width =600, height =400)#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : at 6.975#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : radius 0.000625#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : all data on boundary of neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 6.975#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 0.025#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 1#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : at 12.025#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : radius 0.000625#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : all data on boundary of neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 0.000625#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : zero-width neighborhood. make span bigger#> Warning: Failed to fit group 1.#> Caused by error in `predLoess()`:#> ! NA/NaN/Inf in foreign function call (arg 5)#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.945#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.055#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.223#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : span too small. fewer data values than degrees of#> freedom.#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : pseudoinverse used at 0.95#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : neighborhood radius 2.05#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : reciprocal condition number 0#> Warning in simpleLoess(y, x, w, span, degree = degree, parametric =#> parametric, : There are other near singularities as well. 4.2025# Adjust plotly settings interactive_plot_seaso_2 <- interactive_plot_seaso_2 %>%layout(margin =list(l =40, r =10, b =40, t =40), # Adjust marginslegend =list(orientation ="h", x =0, xanchor ="left", y =-0.2)) # Adjust legend position# Display the interactive plotinteractive_plot_seaso_2
4 Analysis
Answers to the research questions
Different methods considered
Competing approaches
Justifications
4.1 Forecasting Next Year Milk Prices
Click to show code
# re-arragen the df_producteur data in ascending orderdf_producteur <- df_producteur[order(df_producteur$date),]#creating tsibble for organic and non-organic milk pricesdf_producteur_ts_non_bio <-ts(df_producteur$prix_non_bio, start=c(2017, 12), frequency=12)df_producteur_ts_bio <-ts(df_producteur$prix_bio, start=c(2017, 12), frequency=12)#convert the ts object to a tsiible objectdf_producteur_ts_non_bio <-as_tsibble(df_producteur_ts_non_bio)df_producteur_ts_bio <-as_tsibble(df_producteur_ts_bio)
4.1.1 Naive Forecast
Click to show code
# Fit a naive modelfit_non_bio <- df_producteur_ts_non_bio %>%model(naive =NAIVE(value))fit_bio <- df_producteur_ts_bio %>%model(naive =NAIVE(value))# Forecast the next 12 monthsnaive_forecast_non_bio <- fit_non_bio %>%forecast(h =12)naive_forecast_bio <- fit_bio %>%forecast(h =12)plot <- naive_forecast_non_bio %>%autoplot(df_producteur_ts_non_bio, alpha =0.5) +labs(title ="Naive Forecast of Non-Organic Milk Prices",x ="Date",y ="Price") +guides(colour =guide_legend(title ="Forecast"))plotplot <- naive_forecast_bio %>%autoplot(df_producteur_ts_bio, alpha =0.5) +labs(title ="Naive Forecast of Non-Organic Milk Prices",x ="Date",y ="Price") +guides(colour =guide_legend(title ="Forecast"))plot
We observe that this model is very vague because is just a naive model that assumes that the next value will be the same as the last value. But it gives us a starting point to compare with other models.
4.1.2 ARIMA Model
4.1.2.1 Stationarity
Click to show code
# re-arragen the df_producteur data in ascending orderdf_producteur <- df_producteur[order(df_producteur$date),]#creating tsibble for organic and non-organic milk pricesdf_producteur_ts_non_bio <-ts(df_producteur$prix_non_bio, start=c(2017, 12), frequency=12)df_producteur_ts_bio <-ts(df_producteur$prix_bio, start=c(2017, 12), frequency=12)#check for stationarityadf.test(df_producteur_ts_non_bio)#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_non_bio#> Dickey-Fuller = -3, Lag order = 4, p-value = 0.08#> alternative hypothesis: stationaryadf.test(df_producteur_ts_bio)#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_bio#> Dickey-Fuller = -3, Lag order = 4, p-value = 0.08#> alternative hypothesis: stationary
We analyse the stationarity of the time series data for both organic and non-organic milk prices using the Augmented Dickey-Fuller (ADF) test. The Augmented Dickey-Fuller (ADF) test is commonly used to determine whether a unit root is present in a time series dataset. A unit root suggests that a time series is non-stationary, meaning its statistical properties such as mean and variance change over time. On the other hand, if the null hypothesis of the ADF test is rejected, it indicates that the time series is stationary.
We do that because ARIMA models require the time series data to be stationary.
The results of the ADF test for both time series df_producteur_ts_non_bio and df_producteur_ts_bio indicate:
Dickey-Fuller statistic value of -3
Lag order of 4
p-value of 0.08
Solely based on the p-values provided (0.08), we cannot conclusively determine whether the time series data df_producteur_ts_non_bio and df_producteur_ts_bio are stationary or not. They might be stationary, but further analysis or additional tests might be needed for a more definitive conclusion.
We can thus differenciate the data to make it stationary.
Click to show code
#difference the time seriesdf_producteur_ts_non_bio_diff <-diff(df_producteur_ts_non_bio)df_producteur_ts_bio_diff <-diff(df_producteur_ts_bio)#plot them to see the differentiationautoplot(df_producteur_ts_non_bio_diff)+labs(title ="Differenced Time Series of Organic Milk Prices")autoplot(df_producteur_ts_bio_diff) +labs(title ="Differenced Time Series of Bio Milk Prices")#check for stationarityadf.test(df_producteur_ts_non_bio_diff)#> Warning in adf.test(df_producteur_ts_non_bio_diff): p-value smaller#> than printed p-value#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_non_bio_diff#> Dickey-Fuller = -6, Lag order = 4, p-value = 0.01#> alternative hypothesis: stationaryadf.test(df_producteur_ts_bio_diff)#> Warning in adf.test(df_producteur_ts_bio_diff): p-value smaller than#> printed p-value#> #> Augmented Dickey-Fuller Test#> #> data: df_producteur_ts_bio_diff#> Dickey-Fuller = -6, Lag order = 4, p-value = 0.01#> alternative hypothesis: stationary
The results of the ADF test for both differenced time series indicate:
Dickey-Fuller statistic value of -6
Lag order of 4
p-value of 0.01
In this case, the p-value is smaller than the typical significance level of 0.05, indicating strong evidence against the null hypothesis. Therefore, based on the p-values provided (0.01), we can conclude that the differenced time series data are likely stationary.
This suggests that after differencing, the time series data df_producteur_ts_non_bio and df_producteur_ts_bio have become stationary, which is often desirable for various time series analysis techniques and forecasting models.
4.1.2.2 Fitting the ARIMA Model and Forecasting
Click to show code
# Fit the ARIMA modelfit_non_bio <-auto.arima(df_producteur_ts_non_bio, seasonal =TRUE)fit_bio <-auto.arima(df_producteur_ts_bio, seasonal =TRUE)# Forecast the next 12 monthsforecast_non_bio <-forecast(fit_non_bio, h =12)forecast_bio <-forecast(fit_bio, h =12)#show the components used for the ARIMA modelfit_non_bio %>%summary()#> Series: df_producteur_ts_non_bio #> ARIMA(0,1,0)(2,1,0)[12] #> #> Coefficients:#> sar1 sar2#> -0.581 -0.404#> s.e. 0.149 0.142#> #> sigma^2 = 0.643: log likelihood = -78.9#> AIC=164 AICc=164 BIC=170#> #> Training set error measures:#> ME RMSE MAE MPE MAPE MASE ACF1#> Training set 0.0107 0.72 0.494 0.0226 0.716 0.177 0.0927fit_bio %>%summary()#> Series: df_producteur_ts_bio #> ARIMA(0,1,0)(0,1,1)[12] #> #> Coefficients:#> sma1#> -0.658#> s.e. 0.161#> #> sigma^2 = 1.29: log likelihood = -102#> AIC=207 AICc=208 BIC=212#> #> Training set error measures:#> ME RMSE MAE MPE MAPE MASE ACF1#> Training set -0.0205 1.03 0.69 -0.0284 0.802 0.251 -0.0121#plot the forecasted valuesautoplot(forecast_non_bio) +labs(title ="Forecasted Prices of Non-Organic Milk")
Click to show code
autoplot(forecast_bio) +labs(title ="Forecasted Prices of Organic Milk")
4.1.2.3 Fit a SARIMA Model
Click to show code
# Fit the SARIMA modelfit_non_bio_sarima <-auto.arima(df_producteur_ts_non_bio, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)fit_bio_sarima <-auto.arima(df_producteur_ts_bio, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)# Forecast the next 12 monthsforecast_non_bio_sarima <-forecast(fit_non_bio_sarima, h =12)forecast_bio_sarima <-forecast(fit_bio_sarima, h =12)#plot the forecasted valuesautoplot(forecast_non_bio_sarima) +labs(title ="Forecasted Prices of Non-Organic Milk (SARIMA)")autoplot(forecast_bio_sarima) +labs(title ="Forecasted Prices of Organic Milk (SARIMA)")
4.1.2.4 Compare ARIMA and SARIMA forecast
4.1.2.4.1 Organic Milk SARIMA vs ARIMA
Click to show code
# compare forecast_bio vs forecast_bio_sarima Model using AIC
4.1.2.4.2 Non-Organic Milk SARIMA vs ARIMA
Click to show code
# compare forecast_non_bio vs forecast_non_bio_sarima using AIC
4.1.2.5 Forecasted Prices ARIMA
Click to show code
# Create a table of the forecasted values forecast_table_arima <-tibble(Month =seq(as.Date("2023-01-01"), by ="month", length.out =12),Non_Organic_Forecast = forecast_non_bio$mean,Bio_Forecast = forecast_bio$mean)#round the forecasted valuesforecast_table_arima <- forecast_table_arima %>%mutate(across(c(Non_Organic_Forecast, Bio_Forecast), ~round(., 2)))#show the forecasted values using reactablereactable( forecast_table_arima, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
Click to show code
#plot the forecasted valuesforecast_table_arima %>%pivot_longer(cols =c(Non_Organic_Forecast, Bio_Forecast), names_to ="Type", values_to ="Forecasted_Price") %>%ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +geom_line() +labs(title ="Forecasted Prices of Organic and Non-Organic Milk",x ="Month",y ="Price",color ="Type") +theme_minimal()
We used the mean values of the forecasted prices for both organic and non-organic milk to create a table and plot the forecasted prices for the next 12 months. The table provides a detailed view of the forecasted prices, while the plot visualizes the trend of the forecasted prices over time.
4.1.2.6 Forecasted Prices SARIMA
Click to show code
# Create a table of the forecasted values forecast_table_sarima <-tibble(Month =seq(as.Date("2023-01-01"), by ="month", length.out =12),Non_Organic_Forecast = forecast_non_bio_sarima$mean,Bio_Forecast = forecast_bio_sarima$mean)#round the forecasted valuesforecast_table_sarima <- forecast_table_sarima %>%mutate(across(c(Non_Organic_Forecast, Bio_Forecast), ~round(., 2)))#show the forecasted values using reactablereactable( forecast_table_sarima, highlight =TRUE, # Highlight rows on hoverdefaultPageSize =10, # Display 10 rows per pagepaginationType ="numbers", # Use numbers for page navigationsearchable =TRUE, # Make the table searchablesortable =TRUE, # Allow sortingresizable =TRUE# Allow column resizing)
Click to show code
#plot the forecasted valuesforecast_table_sarima %>%pivot_longer(cols =c(Non_Organic_Forecast, Bio_Forecast), names_to ="Type", values_to ="Forecasted_Price") %>%ggplot(aes(x = Month, y = Forecasted_Price, color = Type)) +geom_line() +labs(title ="Forecasted Prices of Organic and Non-Organic Milk",x ="Month",y ="Price",color ="Type") +theme_minimal()
4.1.3 Exponential Smoothing
Click to show code
# Fit the ETS modelfit_non_bio_ets <-ets(df_producteur_ts_non_bio)fit_bio_ets <-ets(df_producteur_ts_bio)# Forecast the next 12 monthsforecast_non_bio_ets <-forecast(fit_non_bio_ets, h =12)forecast_bio_ets <-forecast(fit_bio_ets, h =12)#plot the forecasted valuesautoplot(forecast_non_bio_ets) +labs(title ="Forecasted Prices of Non-Organic Milk (ETS)")autoplot(forecast_bio_ets) +labs(title ="Forecasted Prices of Organic Milk (ETS)")
The Pareto Principle, often known as the 80/20 rule, asserts that a small proportion of causes, inputs, or efforts usually lead to a majority of the results, outputs, or rewards. Applied to a business context where approximately 20% of the sales account for 80% of the revenues, this principle can help in identifying and focusing on the most profitable aspects of a business.
Evidence from Research:
Sales and Customer Concentration: Research has consistently shown that a significant portion of sales often comes from a minority of customers or products. For instance, an analysis across 22 different consumer packaged goods categories found an average Pareto ratio (PR) of .73, indicating that the top proportion of products/customers often account for a disproportionately high share of sales or profits Source - Kim, Singh, & Winer, 2017
Decision Making and Resource Allocation: The Pareto Principle helps in decision-making by highlighting areas where the greatest impact can be achieved. For example, focusing on the top-performing products or customers can optimize resource allocation and maximize profits Source - Ivančić, 2014
Market and Profit Concentration: Another study noted that a small number of customers are often responsible for a large portion of sales, which supports the strategic focus on these customers to boost profitability and efficiency Source- McCarthy & Winer, 2018
Conclusion: Applying the Pareto Principle in a business context where a minority of sales drives the majority of revenue can lead to more focused and effective business strategies, optimizing efforts towards the most profitable segments. This approach not only simplifies decision-making but also enhances resource allocation, ultimately leading to increased profitability.
4.2.1.1 Steps
Calculating the total sales across all locations for both 2022 and 2023.
Ranking locations by sales to see the cumulative contribution of each location towards the total.
Identifying the point where approximately 20% of the locations contribute to around 80% of the sales.
Click to show code
# Calculate the total sales for each year and the combined total to apply Pareto Principlemerged_sales_data['Combined Sales'] = merged_sales_data['Total Sales 2022'] + merged_sales_data['Total Sales 2023']# Sort locations by combined salespareto_data = merged_sales_data.sort_values(by='Combined Sales', ascending=False)# Calculate cumulative salespareto_data['Cumulative Sales'] = pareto_data['Combined Sales'].cumsum()# Calculate the total of combined salestotal_combined_sales = pareto_data['Combined Sales'].sum()# Calculate the percentage of cumulative salespareto_data['Cumulative Percentage'] =100* pareto_data['Cumulative Sales'] / total_combined_sales# Find the point where about 20% of the locations contribute to approximately 80% of the salespareto_data['Location Count'] =range(1, len(pareto_data) +1)pareto_data['Location Percentage'] =100* pareto_data['Location Count'] /len(pareto_data)# Plotting the Pareto curveplt.figure(figsize=(12, 8))cumulative_line = plt.plot(pareto_data['Location Percentage'], pareto_data['Cumulative Percentage'], label='Cumulative Percentage of Sales', color='b', marker='o')plt.axhline(80.2, color='r', linestyle='dashed', linewidth=1)plt.axvline(33.3, color='green', linestyle='dashed', linewidth=1)plt.title('Pareto Analysis of Sales Across Locations')plt.xlabel('Cumulative Percentage of Locations')plt.ylabel('Cumulative Percentage of Sales')plt.legend()plt.grid(True)plt.show()
Given this graph 33.2% of Manor locations are contributing to 80% of sales. This deviates from the typical Pareto 80/20 distribution, but it still shows a concentration of sales among a subset of stores.
4.2.1.2 Observations
We will identify the top 33.3% of locations based on their cumulative sales contribution. This means selecting the smallest number of locations that together account for at least 80% of the total sales.
The top-performing 33.3% of Manor locations that contribute to the majority of sales are: ::: {.cell layout-align=“center”}
Click to show code
# Calculate the threshold for the top 33.3% of locationstop_third_index =int(len(pareto_data) *0.34)# Identifying the top 33.3% of stores contributing to at least 80% of salestop_performing_stores = pareto_data.head(top_third_index)top_performing_stores#> Location Total Sales 2022 ... Location Count Location Percentage#> 14 Monthey 49965 ... 1 3.703704#> 20 Sierre 32212 ... 2 7.407407#> 13 Marin-Epagnier 30228 ... 3 11.111111#> 23 Vevey 25720 ... 4 14.814815#> 10 Geneva 16068 ... 5 18.518519#> 21 Sion 15208 ... 6 22.222222#> 9 Fribourg 14792 ... 7 25.925926#> 11 Lausanne 13186 ... 8 29.629630#> 5 Chavannes-de-Bogis 14359 ... 9 33.333333#> #> [9 rows x 8 columns]
:::
4.2.2 Understanding Success Factors of Top-Performing Stores
4.2.2.1 Correlating Political Parties with Milk Sales
Click to show code
# Calculate correlation coefficients for each partycorrelation_df <-data.frame(Party =c("PLR", "PS", "UDC", "Centre", "Verts"),Correlation =sapply(merged_data[, 4:8], function(x) cor(x, merged_data$`2023`)))# Print the correlation coefficientsprint(correlation_df)#> Party Correlation#> PLR PLR 0.2796#> PS PS 0.1853#> UDC UDC -0.1933#> Centre Centre 0.0375#> Verts Verts 0.1708# Create a matrix of plots for each partyparty_plots <-lapply(names(merged_data)[4:8], function(party) {ggplot(merged_data, aes_string(x ="`2023`", y = party)) +geom_point() +geom_smooth(method ="lm", se =FALSE, color ="blue") +labs(x ="Annual Sales", y =paste(party, "Party Presence (%)"), title =paste("Correlation:", party, "Party vs. Sales")) +theme_minimal()})#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.#> i Please use tidy evaluation idioms with `aes()`.#> i See also `vignette("ggplot2-in-packages")` for more information.# Arrange the plots in a matrix layoutmatrix_plot <- gridExtra::grid.arrange(grobs = party_plots, ncol =2)matrix_plot#> TableGrob (3 x 2) "arrange": 5 grobs#> z cells name grob#> 1 1 (1-1,1-1) arrange gtable[layout]#> 2 2 (1-1,2-2) arrange gtable[layout]#> 3 3 (2-2,1-1) arrange gtable[layout]#> 4 4 (2-2,2-2) arrange gtable[layout]#> 5 5 (3-3,1-1) arrange gtable[layout]